AITopics | single-cell rna-seq data

Collaborating Authors

single-cell rna-seq data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

Neural Information Processing SystemsDec-26-2025, 22:44:36 GMT

Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$^\alpha$ (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction.

efficient and scalable representation learner, name change, xtrimogene, (4 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.59)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data

Xu, Ping, Ning, Zhiyuan, Li, Pengjiang, Liu, Wenhao, Wang, Pengyang, Cui, Jiaxu, Zhou, Yuanchun, Wang, Pengfei

arXiv.org Artificial IntelligenceOct-3-2025

Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that scSiameseClu outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.

data mining, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.12626

Country: Asia > China (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(2 more...)

Add feedback

Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data

Chardès, Victor

arXiv.org Artificial IntelligenceSep-22-2025

Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences, PCR amplification bias, limited sequencing depth, and low capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening method, inspired by the Sinkhorn-Knopp algorithm, that simultaneously stabilizes variance across genes and cells. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.

artificial intelligence, eigenvalue, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.15429

Country: North America > United States (0.28)

Genre:

Research Report (1.00)
Workflow (0.68)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

JojoSCL: Shrinkage Contrastive Learning for single-cell RNA sequence Clustering

Wang, Ziwen

arXiv.org Machine LearningJun-3-2025

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular processes by enabling gene expression analysis at the individual cell level. Clustering allows for the identification of cell types and the further discovery of intrinsic patterns in single-cell data. However, the high dimensionality and sparsity of scRNA-seq data continue to challenge existing clustering models. In this paper, we introduce JojoSCL, a novel self-supervised contrastive learning framework for scRNA-seq clustering. By incorporating a shrinkage estimator based on hierarchical Bayesian estimation, which adjusts gene expression estimates towards more reliable cluster centroids to reduce intra-cluster dispersion, and optimized using Stein's Unbiased Risk Estimate (SURE), JojoSCL refines both instance-level and cluster-level contrastive learning. Experiments on ten scRNA-seq datasets substantiate that JojoSCL consistently outperforms prevalent clustering methods, with further validation of its practicality through robustness analysis and ablation studies. JojoSCL's code is available at: https://github.com/ziwenwang28/JojoSCL.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

2506.0041

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.88)

Add feedback

Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation

Zhao, Huan, Liu, Yiming, Yao, Jina, Xiong, Ling, Zhou, Zexin, Zhang, Zixing

arXiv.org Artificial IntelligenceMar-27-2025

Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.0002

Country:

Asia > China (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.68)
Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

Neural Information Processing SystemsJan-20-2025, 00:04:29 GMT

Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene \alpha (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today.

efficient and scalable representation learner, single-cell rna-seq data, xtrimogene, (2 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.42)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.80)

Add feedback

Uncertainty-aware t-distributed Stochastic Neighbor Embedding for Single-cell RNA-seq Data

Ma, Hui, Chen, Kai

arXiv.org Artificial IntelligenceOct-1-2024

Nonlinear data visualization using t-distributed stochastic neighbor embedding (t-SNE) enables the representation of complex single-cell transcriptomic landscapes in two or three dimensions to depict biological populations accurately. However, t-SNE often fails to account for uncertainties in the original dataset, leading to misleading visualizations where cell subsets with noise appear indistinguishable. To address these challenges, we introduce uncertainty-aware t-SNE (Ut-SNE), a noise-defending visualization tool tailored for uncertain single-cell RNA-seq data. By creating a probabilistic representation for each sample, Our Ut-SNE accurately incorporates noise about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data, revealing significant uncertainties in transcriptomic variability. Through various examples, we showcase the practical value of Ut-SNE and underscore the significance of incorporating uncertainty awareness into data visualization practices. This versatile uncertainty-aware visualization tool can be easily adapted to other scientific domains beyond single-cell RNA sequencing, making them valuable resources for high-dimensional data analysis.

dataset, ut-sne, visualization, (11 more...)

arXiv.org Artificial Intelligence

2410.00473

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology:

Information Technology > Visualization (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling

Dong, Shengze, Cui, Zhuorui, Liu, Ding, Lei, Jinzhi

arXiv.org Artificial IntelligenceApr-9-2024

Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual datasets that share analogous statistical properties. Results: Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual scRNA-seq data by leveraging a real dataset. The method is a neural network constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the real dataset through iterative noise-adding steps and ultimately restoring the noises to form scRNA-seq samples. This scheme allows us to learn data features from actual scRNA-seq samples during model training. Our experiments, conducted on two distinct scRNA-seq datasets, demonstrate superior performance. Additionally, the model sampling process is expedited by incorporating Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified methodology empowering users to train neural network models with their unique scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq samples. Availability and implementation: https://github.com/DongShengze/scRDiT

diffusion model, scrna-seq data, scrna-seq sample, (15 more...)

arXiv.org Artificial Intelligence

2404.06153

Country:

Asia > China > Tianjin Province > Tianjin (0.05)
North America > United States (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Single-Cell Deep Clustering Method Assisted by Exogenous Gene Information: A Novel Approach to Identifying Cell Types

Hu, Dayu, Liang, Ke, Yu, Hao, Liu, Xinwang

arXiv.org Artificial IntelligenceDec-15-2023

In recent years, the field of single-cell data analysis has seen a marked advancement in the development of clustering methods. Despite advancements, most of these algorithms still concentrate on analyzing the provided single-cell matrix data. However, in medical applications, single-cell data often involves a wealth of exogenous information, including gene networks. Overlooking this aspect could lead to information loss and clustering results devoid of significant clinical relevance. An innovative single-cell deep clustering method, incorporating exogenous gene information, has been proposed to overcome this limitation. This model leverages exogenous gene network information to facilitate the clustering process, generating discriminative representations. Specifically, we have developed an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells. Concurrently, we conducted a random walk on an exogenous Protein-Protein Interaction (PPI) network, thereby acquiring the gene's topological features. Ultimately, during the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation. Extensive experiments have validated the effectiveness of our proposed method. This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.

graph, information, matrix, (16 more...)

arXiv.org Artificial Intelligence

2311.17104

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Pennsylvania (0.04)
Asia > Mongolia (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.64)
Overview > Innovation (0.50)
Research Report > New Finding (0.47)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Subject clustering by IF-PCA and several recent methods

Chen, Dieyi, Jin, Jiashun, Ke, Zheng Tracy

arXiv.org Artificial IntelligenceJun-8-2023

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of great interest. In recent years, many approaches were proposed, among which unsupervised deep learning (UDL) has received a great deal of attention. Two interesting questions are (a) how to combine the strengths of UDL and other approaches, and (b) how these approaches compare to one other. We combine Variational Auto-Encoder (VAE), a popular UDL approach, with the recent idea of Influential Feature PCA (IF-PCA), and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on $10$ gene microarray data sets and $8$ single-cell RNA-seq data sets. We find that IF-VAE significantly improves over VAE, but still underperforms IF-PCA. We also find that IF-PCA is quite competitive, which slightly outperforms Seurat and SC3 over the $8$ single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving the phase transition in a Rare/Weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

bioinformatics, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2306.05363

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
(2 more...)

Add feedback